Modeling Molecular Mass
2025-08-01
\[\gamma_{API}=\frac{141.5}{\gamma_{o}}-131.5\]
| Symbol | Meaning | Units |
|---|---|---|
| \(\gamma_{API}\) | API Gravity | [degAPI] |
| \(\gamma_o\) | specific gravity | [1/wtr] |
from pysr import PySRRegressor()
myMod=PySRRegressor()
myMod.fit(x.reshape(-1, 1),y)
y_pred=myMod.predict(x.reshape(-1, 1))
myEq=myMod.sympy()
\[ x_0-(x_0+0.013196754)+1.0131962+ \frac{x_0 (-132.5)- -141.5}{x_0} \]
sym.simplify(myEq)
\[-131.500000554+\frac{141.5}{x_0}\]
w=(random.rand(21)-0.5)*2.5
\[-132.05688+\frac{141.88547}{x_0}\]
binary_operators=["+","-","*","/"]
unary_operators=None
maxsize=30
| 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|
| minus | divide | 141.5 | specific gravity | 131.5 |
maxdepth=None
elemtwise_loss="L2DistLoss()"
\[ \frac{1}{n} \sum_{k=1}^{n}{\left(\hat{y}_k-y_k \right)}^2\]
\[ \frac{1}{n} \sum_{k=1}^{n}{\left|\hat{y}_k-y_k \right|}\]
parsimony=0.0
model_selection="best"
should_simplify=True
Tournament_selection_n=15
niterations=100
max_evals=None
timeout_in_seconds=None
populations=31
population_size=27
\[ \frac{a_{00}}{x}+a_{01}\]
w<-seq(from=121.5,to=161.5, by=10)
<expr> ::= <op>(<expr>, <expr>) | <var> | <con>
<op> ::= "+" | "-" | "*" | "/"
<var> ::= x
<con> ::= 121.5 | 131.5 | 141.5 | 151.5 | 161.5
Grammatical Evolution Search Results:
No. Generations: 1531
Best Expression: 141.5/x - 131.5
Best Cost: 0
| Variable | Description | Designation |
|---|---|---|
| \(Mw\) | Molecular Mass | dependent |
| \(SG\) | Specific Gravity | independent |
| \(TBP\) | True Boiling Point | independent |
SG TBP MW
Min. :0.6310 Min. : 306.0 Min. : 70.0
1st Qu.:0.7439 1st Qu.: 373.2 1st Qu.: 99.0
Median :0.8474 Median : 584.0 Median : 222.5
Mean :0.8470 Mean : 575.2 Mean : 304.6
3rd Qu.:0.8784 3rd Qu.: 668.5 3rd Qu.: 297.8
Max. :1.3793 Max. :1012.0 Max. :1685.0
Raw Molecular Mass Histogram
Raw Molecular Mass Box-and-Whiskers Plot
Raw Boiling Point Box-and-Whiskers Plot
Raw Boiling Point Histogram
Raw Specific Gravity Histogram
Raw Specific Gravity Box-and-Whiskers Plot
Molecular Mass vs Specific Gravity Goossens Data
Molecular Mass vs Boiling Point Goossens Data
Specific Gravity vs Boiling Point Goossens Data
Molecular Mass vs Boiling Point. Both Datasets
Molecular Mass vs Specific Gravity. Both Datasets
| Symbol | Meaning |
|---|---|
| \(M_w\) | Apparent Molecular Mass |
| \(T_b\) | True Boiling Point Temperature |
| \(\gamma_o\) | Specific Gravity |
| \(a_{00}..a_{09}\) | Empirical Constants |
| \(K_w\) | Characterization Factor (intermediate value) |
| \(X_0...X_3\) | Intermediate Variables |
Hariu & Sage (1969)
\[ M_w = a_{00} + a_{01} K_w + a_{02} K_w^2 + a_{03} T_b K_w + a_04 T_b K_w^2 + a_{05} T_b^2 K_w + a_{06} T_b^2 K_w^2 \]
\[K_w =\frac{\sqrt[3]T_b}{\gamma_o}\]
Kesler & Lee (1976)
\[M_w = X_0 + \frac{X_1}{T_b} + \frac{X_2}{T_b^2}\]
\[X_0 = a_{00} + a_{01} γ_o+ \left (a_{02} + a_{03} γ_o \right ) T_b\]
\[ X_1 = \left (1+ a_{04} γ_o + a_{05}γ_o^2 \right ) \left (a_{06} + \frac{a_ {07}}{T_b} \right ) \cdot 10^7 \]
\[ X_2 = \left (1+ a_{08} γ_o+ a_{09} γ_o^2 \right ) \left (a_{10} + \frac{a_{11}}{T_b} \right ) \cdot 10^{12} \]
American Petroleum Institute (1977)
\[ M_w = a_{00} e^ {\left (a_{01} T_b \right )} e^{\left (a_{02} γ_o \right )} T_b^{a_{03}} γ_o^ {a_{04}} \]
Winn, Sim & Daubert (1980)
\[M_w = a_{00} T_b^{a_ {01}} γ_o^{a_{02}}\]
Riazi & Daubert (1980)
\[M_w = a_{00} T_b^{a_ {01}}γ_o^{a_{02}}\]
Rao & Bardon (1985)
\[ln {M_w} = (a_{00} + a_{01} K_w) ln (\frac{T_b} {a_{02} + a_{03} K_w} )\]
Riazi & Daubert (1987)
\[ M_w = a_{00} T_b^{a_{01}} γ_o^{a_{02}} e^{\left (a_{03} T_b + a_{04} γ_o + a_{05} T_b γ \right )} \]
Goossens (1996)
\[M_w = a_{00} T_b^{X_0}\]
\[ X_0 =\frac {a_{03} + a_{04} ln {\left (\frac{T_b} {a_{05} - T_b} \right )}} {a_{01} γ_o + a_{02}} \]
Linan (2011)
\[ M_w = a_{00} e^{\left (a_{01} T_b \right )} e^{\left (a_{02} γ_o \right )} T_b^ {a_{03}} γ_o^{a_{04}} \]
Hosseinifar & Shahverdi (2021)
\[M_w = {\left [a_{00} T_b^{a_{01}} {\left (\frac{3+2γ_o} {3-γ_o} \right )}^{\frac{a_{02}}{2}} + a_{03} T_b^{a_{04}} {\left (\frac{3+2γ_o}{3-γ_o} \right )}^{\frac{a_{05}}{2}} \right ]}^{a_{06}}\]
Stratiev (2023)
\[ M_w = a_{00} + a_{01} e^{\left [a_{02} e^{\left (a_{03} \frac{T_b^{a_{06}}}{γ_o^{a_{05}}} \right )} \right ]} \]
| Operator | Type | Description |
|---|---|---|
| pow | binary | one expression raised to the power of another |
| log | unary | logarithm of an expression |
| exp | unary | antilogarithm of an expression |
| sqr | unary | expression squared |
| cub | unary | expression cubed |
| inv | unary | inverse of an expression |
\[ M_w=a_{00}+\frac{a_{01}}{\gamma_o-a_{02}}+\frac{T_b\cdot (a_{03}\cdot T_b-a_{04})}{\gamma_o\cdot (a_{05}-\frac{a_{06}}{T_b})} \]
| Raw MW | Goossens Equation |
This Equation |
|
|---|---|---|---|
| Raw MW | 1.000000 | 0.999711 | 0.999847 |
| Goossens Equation | 0.999711 | 1.000000 | 0.999798 |
| This Equation | 0.999847 | 0.999798 | 1.000000 |
\[M_w=a_{00} \cdot a_{01}^{\gamma_o^{-a_{02}} + a_{03} \cdot T_b} + a_{04}\]
| Raw MW | This Equation | |
|---|---|---|
| Raw MW | 1.000000 | 0.997281 |
| This Equation | 0.997281 | 1.000000 |
\[ M_w= - T_b \cdot \left(a_{00} \cdot T_b - a_{01}\right) \left(a_{02} \cdot 10^{-6} \left(a_{03} \cdot \gamma_o - 2 \cdot T_b \right) \left(T_b - a_{04}\right) - 1\right) + a_{05}\]
\[ M_w= a_{00}\cdot T_b + a_{01} \cdot e^{- \gamma_o^{2} + a_{02}\cdot \gamma_o + a_{03} \cdot T_b}\]
| Raw MW | 1st Equation |
2nd Equation |
|
|---|---|---|---|
| Raw MW | 1.000000 | 0.997705 | 0.998420 |
| First Equation | 0.997705 | 1.000000 | 0.999497 |
| Second Equation | 0.998420 | 0.999497 | 1.000000 |
Raw Data Scale Change
\[ M_w=(a_{00}\cdot T_b-a_{01})e^{a_{02}\cdot 10^{-9}T_b^2(a_{03}\cdot \gamma_0 +a_{04}\cdot T_b)} \]
\[ M_w=-a_{00}\cdot \gamma_o+\frac{e^{-\gamma_o^2+{log(T_b)}^3}}{T_b^{a_{01}}} +a_{02}\cdot T_b \]
| Raw Mw | First Equation | Second Equation | |
|---|---|---|---|
| Raw Mw | 1.000000 | 0.998880 | 0.999324 |
| First Equation | 0.998880 | 1.000000 | 0.998851 |
| Second Equation | 0.999324 | 0.998851 | 1.000000 |
\(\lambda=-0.3624\)
Shapiro-Wilk p-value \(0.0507\)
\[ M_w^t = \left (\gamma_o \left(a_{00} T_b -a_{01} \right) log{(\gamma_o)} + a_{02} \right) log{(T_b)} \]
Reverse the Box-Cox transform:
\[ M_w = {\left(\lambda M_w^t +1 \right)}^{\frac{1}{\lambda}} \]
| Raw Mw | This Equation | |
|---|---|---|
| Raw Mw | 1.000000 | 0.995897 |
| This Equation | 0.995897 | 1.000000 |
\[ M_w=-a_{00} \cdot \gamma_o +\frac{a_{01}\cdot \gamma_o}{T_b}+a_{02}\cdot T_b -a_{03}\]
| Raw Mw | This Equation | |
|---|---|---|
| Raw Mw | 1.000000 | 0.970451 |
| This Equation | 0.970451 | 1.000000 |
\[ M_w=-a_{00}\cdot \gamma_o\cdot T_b +a_{01}\cdot T^2_b +a_{02}\cdot e^{\gamma_o} +a_{03} \qquad(1)\]
| Raw Mw | ||
|---|---|---|
| Raw Mw | 1.000000 | |
| 1st Run Equation | 0.970451 | |
| 2nd Run Equation | 0.985800 |
| Run | Correlation Coefficient |
|---|---|
| Raw Mw | 1.000000 |
| Default | 0.999957 |
| Goossens Correlation | 0.999939 |
| Power | 0.996964 |
| Exponential #2 | 0.998954 |
| Aeon #1 | 0.999973 |
| Aeon #2 | 0.999921 |
| Box-Cox | 0.999696 |
| Sparse #1 | 0.992303 |
| Sparse #2 | 0.996331 |
All of these are very good, with the validation coefficients exceeding the training coefficients. We will resist the temptation to declare victory over overfitting, as the Hosseinifar dataset is far better behaved than the Goossens dataset.
A better approach with the evolutionary model is probably to start increasing parsimony or decreasing maxsize or maxdepth until the correlation coefficients start to deteriorate.
A different approach is warranted with the deterministic models. It is possible that these are underfit, meaning that we can expand the number of unique terms the algorithm initially considers, or the number of terms we want to keep.
The “UOP Characterization Factor” ?@eq-UOP which (at least) two researchers decided was an important correlating variable for molecular mass, was not discovered by our symbolic regression model.
In fairness, however, we dropped the division operator early in our experimentation, and never asked the model to consider a “cube root” unary operator.
Replication is very challenging with these symbolic regression libraries that use evolutionary algorithms. Re-running the same dataset on the same machine with the same libraries produces different equations. In many Scikit-Learn algorithms, answer variation can be minimized by seeding the random number generator the same way each time, but in PySR we also have to turn off the optimizations that allow Julia to run the calculations massively in parallel, which contributes to the criticism that these algorithms are slow.
The deterministic symbolic regression libraries don’t need random number seeds, and the answers stay the same between runs. The equations are definitely more explainable, but appear to be less accurate. Does this accuracy matter? Is it real? The answer to these questions depend on how good we think our measurements are!
In the engineering world, an old trick to speed-up calculations is to use look-up tables instead of functions. The idea is that even though we may know exactly what the governing equations are, producing look up tables from these functions ahead of time and then applying Gaussian interpolation in real time may be faster. This is a very similar concept to the idea of pre-training a transformer network to do symbolic regression.
This paper has shown that high fidelity models can be constructed from “cheap” components (i.e. arithmetic operators). Maybe explainability is overrated, and functions carefully built using symbolic regression may be competitive with old-man Gauss!
If this turns out to be true, then symbolic regression will make the full circle and return to it roots.
In summary, we have explored Symbolic Regression using evolutionary and deterministic algorithms on our molecular mass dataset. We have not spent much time on different kinds of engines (e.g. neural networks, mathematical programming) or tested the extrapolation chops of this technology.
Our main conclusions are:
Symbolic Regression can generate reasonable predictions when trained on datasets too small for neural networks.
Evolutionary Symbolic Regression will typically generate novel equations from the human point of view, but these equations can be as accurate than the ones created by humans.
The “explainability” of the generated equations are probably overrated, unless enough guardrails are put up to constrain the space searched by the evolutionary Symbolic Regression algorithm.
The ability of Evolutionary Symbolic Regression to easily find alternative equations of about the same accuracy means that humans can repeat the workflow until an equation that is more “explainable” than the others is generated.
The structure of an equation generated with symbolic regression will withstand some noise.
Piping the “winning” equation through a symbolic mathematics package to simplify and consolidate it is a good idea. The effect on the deterministic equations is very slight.
This consolidation often introduces relationships between variables and constants that were not specified before the algorithm was started. This can nudge the human to reconsider relationships previously rejected.
All Symbolic Regression libraries using evolutionary algorithms are not created equal. Library comparison is beyond the scope of this report, but we can comment that PySRwas one of the better ones.
Running both evolutionary and deterministic models on the same problem is also a good idea.